Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor
نویسندگان
چکیده
Nowadays, many websites in the Internet are multilingual and may be considered sources of parallel corpora. In this paper we will describe the free/open-source tool Bitextor, created to harvest aligned bitexts from these multilingual websites, which may be used to train corpusbased machine translation systems. This tool uses the work developed in previous approaches withmodifications and improvements in order to obtain a tool as adaptable as possible tomake it easier to process any kind of websites and work with any pairs of languages. Content-based andURL-based heuristics and algorithms applied to identify and align the parallelwebpages in awebsite will be described and, finally, some results will be presented to show the functionality of the application and set the future work lines for this project.
منابع مشابه
Bitextor, a free/open-source software to harvest translation memories from multilingual websites
Bitextor is a free/open-source application for harvesting translation memories from multilingual websites. It downloads all the HTML files in a website, preprocesses them into a coherent format and, finally, applies a set of heuristics to select pairs of files which are candidates to contain the same text in two different languages (bitexts). From these parallel texts, translation memories are ...
متن کاملHarvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding
Obtaining bilingual parallel data from the multilingual websites is a long-standing research problem, which is very benefit for resource-scarce languages. In this paper, we present an approach for obtaining parallel data based on word embedding, and our model only rely on a small scale of bilingual lexicon. Our approach benefit from the recent advances of continuous word representations, which ...
متن کاملComparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites
In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English–Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manual...
متن کاملPreparation and exploitation of bilingual texts
A bitext is a merged document composed of two versions of a given text, usually in two different languages. An aligned bitext is produced by an alignment tool or aligner, that automatically aligns or matches the versions of the same text, generally sentence by sentence. A multilingual aligned corpus or collection of aligned bitexts, when consulted with a search tool, can be extremely useful for...
متن کاملIUCL: Combining Information Sources for SemEval Task 5
We describe the Indiana University system for SemEval Task 5, the L2 writing assistant task, as well as some extensions to the system that were completed after the main evaluation. Our team submitted translations for all four language pairs in the evaluation, yielding the top scores for English-German. The system is based on combining several information sources to arrive at a final L2 translat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Prague Bull. Math. Linguistics
دوره 93 شماره
صفحات -
تاریخ انتشار 2010